Naive Bayes Classifier: A Simple Implementation

This post is a simple code implementation of the research paper ‘Naive Bayes Classifier for Efficient Text Classification’ by Peter Norvig.
code
research paper
impact
Author

Risheek kumar B

Published

June 15, 2026

How a reverend from the 1700s gave us one of the most practical classifiers in machine learning

In 1763, a English Presbyterian minister named Thomas Bayes had his most famous work published — posthumously. He’d been thinking about a simple question: if I see something happen, how should that change what I believe? Nearly 260 years later, his theorem powers spam filters, medical diagnoses, and recommendation engines. Let’s trace the journey from a bag of balls to a working classifier — and build everything by hand along the way.

The Bag of Balls

Before we get to spam, let’s start somewhere tangible. Imagine you have a bag with 3 red balls and 7 blue balls. You draw one ball, don’t put it back, and draw another.

Question: what’s the probability both are red?

Think about it — the first draw has a 3/10 chance. But if you drew red, now there are only 2 red balls left out of 9 total. So:

\[\frac{3}{10} \times \frac{2}{9} = \frac{1}{15}\]

This is the multiplication rule for dependent events — the second draw depends on what happened in the first. This dependency is exactly where Bayes’ insight begins.

Bayes’ Theorem

Let’s build the theorem from scratch. Start with the definition of conditional probability:

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

In plain English: “out of all the times B happens, how often does A also happen?”

Rearranging: \(P(A \cap B) = P(A|B) \cdot P(B)\)

By symmetry, we can also write: \(P(A \cap B) = P(B|A) \cdot P(A)\)

Both expressions equal \(P(A \cap B)\), so set them equal and solve:

\[\boxed{P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}}\]

This is Bayes’ theorem — the equation Bayes never actually wrote in this form (that was Laplace). The terms have names:

  • Prior \(P(A)\) — what you believed before seeing evidence
  • Likelihood \(P(B|A)\) — how probable the evidence is, given your hypothesis
  • Posterior \(P(A|B)\) — your updated belief after seeing evidence

The Bayesian mindset in one sentence: start with a prior, update it with evidence.

Frequentists vs Bayesians: A Centuries-Old Debate

Bayes’ theorem isn’t just a formula — it represents a fundamentally different way of thinking about probability. This difference sparked one of the longest-running arguments in statistics.

The Frequentist view: Probability is about long-run frequencies. A coin has a 50% chance of heads because if you flipped it infinitely many times, half would be heads. A hypothesis (like “this drug works”) is either true or false — there’s no meaningful way to say “I’m 80% sure it works.”

The Bayesian view: Probability is a measure of belief. You can say “I’m 80% sure this drug works” — it reflects your state of knowledge, updated by evidence. Before a trial, you have a prior belief. After seeing data, you have a posterior belief.

This isn’t just philosophy — it changes how you answer practical questions:

Question Frequentist Bayesian
“Is this email spam?” Either yes or no; I’ll use a test with 95% confidence P(spam) = 0.88 given these words
“Does this drug work?” Reject or fail to reject the null hypothesis There’s a 93% probability it works
“How confident are you?” “If I repeated this experiment 100 times…” “Given everything I’ve seen…”

For most of the 20th century, frequentist methods dominated (think p-values, confidence intervals, hypothesis tests). But Bayesian methods have surged in recent decades, powered by faster computers that can handle the harder calculations.

Naive Bayes sits squarely in the Bayesian camp: it starts with a prior (how common is spam?) and updates it with evidence (which words appear?). The “prior → evidence → posterior” loop is the beating heart of Bayesian thinking.

Applying Bayes to Classification

Now let’s make this practical. Suppose you’re building a spam filter for email. You want to classify a new email as spam or not spam based on the words it contains.

Using Bayes’ theorem:

\[P(\text{spam} | \text{words}) = \frac{P(\text{words} | \text{spam}) \cdot P(\text{spam})}{P(\text{words})}\]

The prior \(P(\text{spam})\) is easy — just the fraction of spam emails in your training data. But \(P(\text{words} | \text{spam})\)? That’s the probability of seeing this exact combination of words in a spam email. With a vocabulary of 10,000 words, the number of possible combinations is astronomical. You’d never have enough data to estimate it directly.

This is the wall that Bayes’ theorem alone can’t climb. We need a simplification.

The Naive Assumption

Here’s the trick that makes it all work: assume every word is independent of every other word.

Instead of computing \(P(\text{"free", "money"} | \text{spam})\) as one monster probability, we break it apart:

\[P(\text{"free"} | \text{spam}) \times P(\text{"money"} | \text{spam})\]

This is obviously wrong. “Free” and “win” tend to appear together in spam. “Dear” and “friend” travel as a pair. Words aren’t independent.

But here’s the surprising twist — it doesn’t matter much for classification. Even if the individual probabilities are off, the ranking of classes usually stays correct. The spam email still scores higher than the not-spam email. Naive Bayes is a poor probability estimator but a surprisingly good classifier.

This is why it’s called “naive” — and why it works despite being naive.

Hands-On: Naive Bayes by Hand

Let’s walk through a concrete example with three training emails:

Email Label
“free money” spam
“free offer” spam
“meeting tomorrow” not spam

We want to classify the new email: “free money”

Step 1: Compute priors

Out of 3 emails, 2 are spam:

\[P(\text{spam}) = \frac{2}{3}, \quad P(\text{not spam}) = \frac{1}{3}\]

Step 2: Count words per class

Spam words: free, money, free, offer → 4 total Not spam words: meeting, tomorrow → 2 total Vocabulary: {free, money, offer, meeting, tomorrow} → 5 unique words

Step 3: Compute likelihoods

\[P(\text{free} | \text{spam}) = \frac{2}{4} = \frac{1}{2}\] \[P(\text{money} | \text{spam}) = \frac{1}{4}\]

But for not spam — “free” and “money” never appeared. Their probability is zero. Multiply by zero and the entire score vanishes. This is the zero probability problem, and it’s a dealbreaker in real systems.

Laplace Smoothing

The fix is beautifully simple: add 1 to every word count, so no word ever has zero probability. Add the vocabulary size to the denominator to keep things normalized:

\[P(\text{word} | \text{class}) = \frac{\text{count(word in class)} + 1}{\text{total words in class} + |\text{vocab}|}\]

After smoothing:

\[P(\text{free} | \text{spam}) = \frac{2+1}{4+5} = \frac{3}{9} = \frac{1}{3}, \quad P(\text{money} | \text{spam}) = \frac{1+1}{4+5} = \frac{2}{9}\]

\[P(\text{free} | \text{not spam}) = \frac{0+1}{2+5} = \frac{1}{7}, \quad P(\text{money} | \text{not spam}) = \frac{0+1}{2+5} = \frac{1}{7}\]

No more zeros. Every word has a nonzero chance, no matter how rare.

Step 4: Classify “free money”

\[\text{Score}_{\text{spam}} = \frac{2}{3} \times \frac{3}{9} \times \frac{2}{9} = \frac{12}{243} \approx 0.0494\]

\[\text{Score}_{\text{not spam}} = \frac{1}{3} \times \frac{1}{7} \times \frac{1}{7} = \frac{1}{147} \approx 0.0068\]

Spam wins — by a factor of 7! Notice we never divided by \(P(\text{words})\). It’s the same for both classes, so it cancels out when we compare. That’s why Naive Bayes is so fast — we only need the numerators.

Two Flavors: Multinomial vs Gaussian

So far we’ve been counting words — discrete values like 0, 1, 2… This is where Multinomial Naive Bayes lives. The name comes from the multinomial distribution, a generalization of the binomial: instead of coin flips (2 outcomes), you have dice rolls (many outcomes). Each word drawn from an email is like rolling a vocabulary-sized die.

But what if your features are continuous? Blood pressure readings, temperatures, pixel intensities? That’s where Gaussian Naive Bayes steps in. It assumes each feature follows a bell curve (normal distribution) within each class. Instead of counting, you compute the mean and variance from training data, then plug a new value into the Gaussian formula.

Rule of thumb: Multinomial for counts, Gaussian for measurements. The right choice depends on your data.

Implementation

Let’s put it all together — first from scratch, then with sklearn.

From scratch:

spam_words = ["free", "money", "free", "offer"]
notspam_words = ["meeting", "tomorrow"]
vocab = {"free", "money", "offer", "meeting", "tomorrow"}
V = len(vocab)

def p_word_given_class(word, class_words):
    count = class_words.count(word)
    return (count + 1) / (len(class_words) + V)

p_spam = 2/3
p_notspam = 1/3

test = ["free", "money"]

score_spam = p_spam
for w in test:
    score_spam *= p_word_given_class(w, spam_words)

score_notspam = p_notspam
for w in test:
    score_notspam *= p_word_given_class(w, notspam_words)

# Normalized probability of spam
print(score_spam / (score_spam + score_notspam))  # 0.879
0.8789237668161435

With sklearn:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = ["free money", "free offer", "meeting tomorrow"]
labels = [1, 1, 0]

vec = CountVectorizer()
X = vec.fit_transform(emails)

model = MultinomialNB(alpha=1.0)  # alpha=1 is Laplace smoothing
model.fit(X, labels)

test = vec.transform(["free money"])
print(model.predict(test))         # [1] = spam
print(model.predict_proba(test))   # [[0.121, 0.879]]
[1]
[[0.12107623 0.87892377]]

Both approaches give the same answer — 87.9% spam probability. The sklearn version scales to millions of emails.

When to Use Naive Bayes

Naive Bayes isn’t always the best tool, but it’s often the best first tool.

Reach for it when: - ✅ You need a fast, interpretable baseline - ✅ Training data is limited (it outperforms logistic regression with scarce data) - ✅ You’re working with high-dimensional sparse data like text - ✅ You need real-time predictions

Watch out when: - ⚠️ Features are strongly correlated (it double-counts evidence) - ⚠️ You need well-calibrated probabilities, not just classification

Interview tip: If asked “when would you choose Naive Bayes over logistic regression?”, the key answer is small data and text. As data grows, logistic regression catches up and surpasses it because it can learn feature interactions that Naive Bayes ignores.

Thomas Bayes never imagined spam filters. But his simple idea — update your beliefs with evidence — turned out to be one of the most practical tools in machine learning.

No matching items